As the name implies, we want to reduce a set of several variables into few variables. In this session, we will practice two basic techniques:
Cluster analysis.
Factor analysis.
Simply speaking, this technique will organize the cases (rows) into a small set of groups, based on the information (the columns) available for each case. This technique will create a new variable, which will be of categorical type, generally ordinal.
Let me bring back the data we prepared in Python:
# clean memory
rm(list = ls())
# link to data file
link='https://github.com/EvansDataScience/CTforGA_integrating/raw/main/demo_fragile.RDS'
# a RDS file from the web needs:
myFile=url(link)
# reading in the data:
fromPy=readRDS(file = myFile)
# reset indexes to R paradigm:
row.names(fromPy)=NULL
# check data types
str(fromPy)
## 'data.frame': 165 obs. of 21 variables:
## $ Country : chr "Norway" "New Zealand" "Finland" "Sweden" ...
## $ Regimetype : Ord.factor w/ 4 levels "Authoritarian"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Overallscore : num 9.75 9.37 9.27 9.26 9.18 9.09 9 8.9 8.9 8.88 ...
## $ Electoralprocessandpluralism: num 10 10 10 9.58 10 10 10 10 9.58 9.58 ...
## $ Functioningofgovernment : num 9.64 8.93 9.29 9.29 8.21 8.93 7.86 8.57 8.93 8.93 ...
## $ Politicalparticipation : num 10 9.44 8.89 8.33 8.89 8.33 8.33 7.78 7.78 8.33 ...
## $ Politicalculture : num 10 8.75 8.75 10 9.38 9.38 9.38 8.75 9.38 8.75 ...
## $ Civilliberties : num 9.12 9.71 9.41 9.12 9.41 8.82 9.41 9.41 8.82 8.82 ...
## $ Total : num 0.0419 0.2304 0 0.5445 0.1885 ...
## $ C1SecurityApparatus : num 1.8 1.4 2.5 2.7 0.7 1.7 2.7 2.7 1.6 2.2 ...
## $ C2FactionalizedElites : num 1.1 1.4 1.4 1.8 1.8 1.4 1.5 1.7 1 3.4 ...
## $ C3GroupGrievance : num 3.3 2.6 0.6 1.7 0.5 3.7 0.5 3.1 2.7 3.6 ...
## $ E1Economy : num 1.9 3.4 2.9 1.9 3.4 1.7 2.7 1.6 2 2.2 ...
## $ E2EconomicInequality : num 1 2.1 1 1.7 1.3 1.2 1.6 1.8 1.8 1.6 ...
## $ E3HumanFlightandBrainDrain : num 0.8 1.6 1.5 0.7 1.9 1.3 2.8 0.5 1.1 2.4 ...
## $ P1StateLegitimacy : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ P2PublicServices : num 1.6 1.4 1.6 1.4 1.2 1.4 2.5 2.8 1.6 1.3 ...
## $ P3HumanRights : num 0.5 0.5 0.5 0.9 0.5 0.9 1.6 1.7 0.8 0.7 ...
## $ S1DemographicPressures : num 1.4 1.4 1.7 3.3 1.5 2.3 2.8 2.9 3.2 3.1 ...
## $ S2RefugeesandIDPs : num 2.2 1.6 1.5 4.3 1.5 2.2 1.4 2 3.1 2.6 ...
## $ X1ExternalIntervention : num 0.5 0.5 0.5 0.5 3.2 0.5 1.6 0.5 0.5 0.5 ...
We have several countries, and several columns. In clustering, we try to create groups so that we have the highest homogeneity within groups, and the highest heterogeneity between groups. The variables will serve compute some distance among the cases so that the clustering algorithms will try to detect the homogeneity and heterogeneity mentioned:
# select variables
dataToCluster=fromPy[,-c(1:3,9)]
#save the country names as the row index
row.names(dataToCluster)=fromPy$Country
summary(dataToCluster)
## Electoralprocessandpluralism Functioningofgovernment Politicalparticipation
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 1.420 1st Qu.:2.500 1st Qu.: 3.890
## Median : 7.000 Median :5.000 Median : 5.560
## Mean : 5.619 Mean :4.616 Mean : 5.377
## 3rd Qu.: 9.170 3rd Qu.:6.430 3rd Qu.: 6.670
## Max. :10.000 Max. :9.640 Max. :10.000
## Politicalculture Civilliberties C1SecurityApparatus C2FactionalizedElites
## Min. : 1.250 Min. :0.000 Min. : 0.500 Min. : 1.000
## 1st Qu.: 3.750 1st Qu.:3.240 1st Qu.: 3.500 1st Qu.: 5.100
## Median : 5.000 Median :5.590 Median : 5.600 Median : 7.300
## Mean : 5.351 Mean :5.326 Mean : 5.294 Mean : 6.601
## 3rd Qu.: 6.250 3rd Qu.:7.650 3rd Qu.: 6.900 3rd Qu.: 8.400
## Max. :10.000 Max. :9.710 Max. :10.000 Max. :10.000
## C3GroupGrievance E1Economy E2EconomicInequality
## Min. :0.500 Min. :1.200 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.400 1st Qu.:3.600
## Median :5.800 Median :5.900 Median :5.200
## Mean :5.758 Mean :5.778 Mean :5.255
## 3rd Qu.:7.500 3rd Qu.:7.100 3rd Qu.:7.100
## Max. :9.900 Max. :9.800 Max. :9.600
## E3HumanFlightandBrainDrain P1StateLegitimacy P2PublicServices P3HumanRights
## Min. :0.500 Min. : 0.500 Min. : 1.200 Min. :0.500
## 1st Qu.:3.800 1st Qu.: 3.700 1st Qu.: 3.700 1st Qu.:3.400
## Median :5.500 Median : 6.400 Median : 5.400 Median :6.000
## Mean :5.212 Mean : 5.817 Mean : 5.719 Mean :5.447
## 3rd Qu.:6.800 3rd Qu.: 8.200 3rd Qu.: 8.000 3rd Qu.:7.400
## Max. :9.000 Max. :10.000 Max. :10.000 Max. :9.800
## S1DemographicPressures S2RefugeesandIDPs X1ExternalIntervention
## Min. :1.400 Min. : 0.700 Min. : 0.500
## 1st Qu.:4.100 1st Qu.: 2.800 1st Qu.: 3.400
## Median :5.700 Median : 4.300 Median : 5.400
## Mean :5.901 Mean : 4.802 Mean : 5.124
## 3rd Qu.:8.000 3rd Qu.: 6.600 3rd Qu.: 6.900
## Max. :9.800 Max. :10.000 Max. :10.000
boxplot(dataToCluster,horizontal = T, las=2,cex.axis=0.4)
The data values have a similar range, then you do not need to transform the data. Possible alternatives could have been:
library(BBmisc)
#### standardizing
normalize(dataToCluster,method='standardize')
### normalizing
normalize(dataToCluster,method='range',range=c(0,10))
set.seed(999)
library(cluster)
distanceMatrix=daisy(x=dataToCluster, metric = "gower")
projectedData = cmdscale(distanceMatrix, k=2)
The object projectedData is saving coordinates for each element in the data:
# save coordinates to original data frame:
fromPy$dim1 = projectedData[,1]
fromPy$dim2 = projectedData[,2]
# see some:
fromPy[,c('dim1','dim2')][1:10,]
## dim1 dim2
## 1 -0.4637066 -0.017476271
## 2 -0.4374335 -0.016755363
## 3 -0.4492530 -0.017650599
## 4 -0.4115496 0.002026328
## 5 -0.4318327 -0.012772446
## 6 -0.4280394 -0.017597666
## 7 -0.3980438 -0.013114093
## 8 -0.4004756 -0.016744500
## 9 -0.4125417 -0.010853684
## 10 -0.3835142 -0.011989627
library(ggplot2)
base= ggplot(data=fromPy,
aes(x=dim1, y=dim2,
label=Country))
base + geom_text(size=2)
Can you see some groups emerging?
An alternative is the dendogram:
# prepare hierarchical cluster
hc = hclust(distanceMatrix)
# very simple dendrogram
plot(hc,hang = -1,cex=0.5)
There are several techniques for clustering, each with pros and cons. We will review the hierarchical clustering here. As the name implies, it will create clusters by creating all the possible clusters, so it is up to us to identify how many clusters emerge. The dendogram helps to identify how many clusters can be found. The hierarchical approach has two different strategies. The agglomerative approach starts by considering each case (row) a cluster, and then, using the distances links the individual cases. It is not expected that all cases are linked during the first step. The second step will create more clusters, and so on. The linking process can also vary (we will use the ward linkage). On the other hand, the divisive approach starts by considering all the cases as one cluster, and then divides them.
Using function fviz_nbclust from the library factoextra, we can see how many clustered are suggested.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(dataToCluster,
hcut,
diss=distanceMatrix,
method = "gap_stat",
k.max = 10,
verbose = F,
hc_func = "agnes")
fviz_nbclust(dataToCluster,
hcut,
diss=distanceMatrix,
method = "gap_stat",
k.max = 10,
verbose = F,
hc_func = "diana")
We could accept the number of cluster suggested or not. Let’s use the suggestion:
Here you can compute the clusters. I am using both, the aggregative and the divisive . Notice that you have to indicate a priori the amount of clusters required.
NumberOfClusterDesired=4
#library(factoextra)
res.agnes= hcut(distanceMatrix,
k = NumberOfClusterDesired,
isdiss=TRUE,
hc_func='agnes',
hc_method = "ward.D2")
# Hierarchical technique- divisive approach
res.diana= hcut(distanceMatrix,
k = NumberOfClusterDesired,
isdiss=TRUE,
hc_func='diana',
hc_method = "ward.D2")
3.1 Save results to original data frame:
fromPy$agn=as.factor(res.agnes$cluster)
fromPy$dia=as.factor(res.diana$cluster)
3.2 Verify ordinality in clusters: Each cluster has a label. Should the result represent some ordering, the labels does not reflect that necessarily. Below, I show that I used the columns Overallscore to get an idea of the real ordering that the labels represent.
aggregate(data=fromPy,
Overallscore~agn,
FUN=mean)
## agn Overallscore
## 1 1 8.082250
## 2 2 6.272222
## 3 3 3.503220
## 4 4 2.627143
aggregate(data=fromPy,
Overallscore~dia,
FUN=mean)
## dia Overallscore
## 1 1 8.309118
## 2 2 6.553333
## 3 3 5.128333
## 4 4 2.843019
You could recode these values so that the labels represent an ascending order.
library(dplyr)
fromPy$agn=dplyr::recode_factor(fromPy$agn,
`1` = '4',`2`='3',`3`='2',`4`='1')
fromPy$dia=dplyr::recode_factor(fromPy$dia,
`1` = '4',`2`='3',`3`='2',`4`='1')
The hierarchical clustering process returns the the silhouette information for each observation, a measure that of how well a case has been classified. Silhouettes vary from -1 to +1, where the higher the positive value the better classified a case is. Low positive values informs of poor clustering. Negative values informs of wrongly clustered cases.
4.1 Plot silhouettes
fviz_silhouette(res.agnes)
## cluster size ave.sil.width
## 1 1 40 0.42
## 2 2 45 0.25
## 3 3 59 0.19
## 4 4 21 0.34
library(factoextra)
fviz_silhouette(res.diana)
## cluster size ave.sil.width
## 1 1 34 0.36
## 2 2 24 0.14
## 3 3 54 0.13
## 4 4 53 0.27
4.2 Detecting cases wrongly clustered
Previos results have saved important information. Let me keep the negative sihouette values:
agnEval=data.frame(res.agnes$silinfo$widths)
diaEval=data.frame(res.diana$silinfo$widths)
agnPoor=rownames(agnEval[agnEval$sil_width<0,])
diaPoor=rownames(diaEval[diaEval$sil_width<0,])
Now, I can see what countries are not well clustered:
library("qpcR")
bad_Clus=as.data.frame(qpcR:::cbind.na(sort(agnPoor),
sort(diaPoor)))
names(bad_Clus)=c("agn","dia")
bad_Clus
## agn dia
## 1 Algeria Guyana
## 2 Bangladesh Italy
## 3 Chile North Macedonia
## 4 Croatia Sierra Leone
## 5 Cyprus Singapore
## 6 Equatorial Guinea <NA>
## 7 Gabon <NA>
## 8 Hungary <NA>
## 9 Jordan <NA>
## 10 Kyrgyzstan <NA>
## 11 Lesotho <NA>
## 12 Madagascar <NA>
## 13 Morocco <NA>
## 14 Papua New Guinea <NA>
## 15 Sri Lanka <NA>
## 16 Turkey <NA>
Color the maps:
base= ggplot(data=fromPy,
aes(x=dim1, y=dim2,
label=Country))
agnPlot=base + labs(title = "AGNES") + geom_point(size=2,
aes(color=agn),
show.legend = T)
- For Hierarchical DIANA:
diaPlot=base + labs(title = "DIANA") + geom_point(size=2,
aes(color=dia),
show.legend = T)
library(ggpubr)
ggarrange(agnPlot, diaPlot,ncol = 2,common.legend = T)
Annotating outliers:
# If name of country in black list, use it, else get rid of it
LABELdia=ifelse(fromPy$Country%in%diaPoor,fromPy$Country,"")
LABELagn=ifelse(fromPy$Country%in%agnPoor,fromPy$Country,"")
- Prepare plot with the outlying labels:
library(ggrepel)
diaPlot=diaPlot +
geom_text_repel(aes(label=LABELdia),
max.overlaps=50,
min.segment.length =unit(0,'lines'))
agnPlot= agnPlot +
geom_text_repel(aes(label=LABELagn),
max.overlaps = 50,
min.segment.length = unit(0, 'lines'))
- Plot and compare:
ggarrange(agnPlot,
diaPlot,
ncol = 2,
common.legend = T)
fviz_dend(res.agnes,
k=NumberOfClusterDesired,
cex = 0.45,
horiz = T,
main = "AGNES approach")
fviz_dend(res.diana,
k=NumberOfClusterDesired,
cex = 0.45,
horiz = T,
main = "DIANA approach")
Let’s compare these clusters with the levels proposed by The Economist:
table(fromPy$Regimetype,fromPy$agn)
##
## 4 3 2 1
## Authoritarian 0 0 38 21
## Hybrid regime 0 15 18 0
## Flawed democracy 20 30 3 0
## Full democracy 20 0 0 0
table(fromPy$Regimetype,fromPy$dia)
##
## 4 3 2 1
## Authoritarian 0 2 13 44
## Hybrid regime 0 0 24 9
## Flawed democracy 14 22 17 0
## Full democracy 20 0 0 0
This way, you can use the clustering results to validate other classifications done theoretically or by simpler techniques (i.e. averaging).
Simply speaking, this technique tries to express in one (or few) dimension(s) the behavior of several others. FA assumes that the several input variables have ‘something’ in common, there is something latent that the set of input variables represent.
Let follow this steps:
Our dataForFA has the same data as the one we used for clustering.
dataForFA=dataToCluster
We know that we have two sets of variables, one related to democracy and other to fragility, all of them computed at the country level.
In Factor analysis we need a measure of similarity among variables (not cases). Let me compute the heterogenous correlation matrix (Pearson correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables).
library(polycor)
corMatrix=polycor::hetcor(dataForFA)$correlations
We can visualize this matrix:
library(ggcorrplot)
ggcorrplot(corMatrix,
type = "lower") +
theme(axis.text.x = element_text(size = 5),
axis.text.y = element_text(size = 5))
You should notice that the set of variables that belong to a concept are
correlated among one another. Variables from different concepts should
have a low correlation.
3.1 The amount of data should be enough for the correlation process:
library(psych)
psych::KMO(corMatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix)
## Overall MSA = 0.94
## MSA for each item =
## Electoralprocessandpluralism Functioningofgovernment
## 0.92 0.96
## Politicalparticipation Politicalculture
## 0.94 0.96
## Civilliberties C1SecurityApparatus
## 0.92 0.95
## C2FactionalizedElites C3GroupGrievance
## 0.94 0.93
## E1Economy E2EconomicInequality
## 0.97 0.94
## E3HumanFlightandBrainDrain P1StateLegitimacy
## 0.96 0.94
## P2PublicServices P3HumanRights
## 0.91 0.94
## S1DemographicPressures S2RefugeesandIDPs
## 0.94 0.94
## X1ExternalIntervention
## 0.94
3.2 The correlation matrix should not be an identity matrix:
# is identity matrix?
cortest.bartlett(corMatrix,n=nrow(dataForFA))$p.value>0.05
## [1] FALSE
3.2. The correlation matrix should not be singular
library(matrixcalc)
is.singular.matrix(corMatrix)
## [1] TRUE
If some conditions fail you may not expect a reliable result, however, you may continue to see the sources of the flaws.
fa.parallel(dataForFA, fa = 'fa',correct = T,plot = F)
## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
library(GPArotation)
resfa <- fa(dataForFA,
nfactors = 2,
cor = 'mixed',
rotate = "varimax",
fm="minres")
### see results
print(resfa$loadings)
##
## Loadings:
## MR1 MR2
## Electoralprocessandpluralism -0.215 -0.885
## Functioningofgovernment -0.516 -0.744
## Politicalparticipation -0.294 -0.743
## Politicalculture -0.427 -0.570
## Civilliberties -0.284 -0.930
## C1SecurityApparatus 0.735 0.465
## C2FactionalizedElites 0.584 0.645
## C3GroupGrievance 0.407 0.487
## E1Economy 0.799 0.304
## E2EconomicInequality 0.795 0.380
## E3HumanFlightandBrainDrain 0.780 0.246
## P1StateLegitimacy 0.479 0.820
## P2PublicServices 0.885 0.330
## P3HumanRights 0.466 0.800
## S1DemographicPressures 0.841 0.305
## S2RefugeesandIDPs 0.685 0.384
## X1ExternalIntervention 0.714 0.424
##
## MR1 MR2
## SS loadings 6.503 6.090
## Proportion Var 0.383 0.358
## Cumulative Var 0.383 0.741
You can see better using a cutoff:
print(resfa$loadings,cutoff = 0.5)
##
## Loadings:
## MR1 MR2
## Electoralprocessandpluralism -0.885
## Functioningofgovernment -0.516 -0.744
## Politicalparticipation -0.743
## Politicalculture -0.570
## Civilliberties -0.930
## C1SecurityApparatus 0.735
## C2FactionalizedElites 0.584 0.645
## C3GroupGrievance
## E1Economy 0.799
## E2EconomicInequality 0.795
## E3HumanFlightandBrainDrain 0.780
## P1StateLegitimacy 0.820
## P2PublicServices 0.885
## P3HumanRights 0.800
## S1DemographicPressures 0.841
## S2RefugeesandIDPs 0.685
## X1ExternalIntervention 0.714
##
## MR1 MR2
## SS loadings 6.503 6.090
## Proportion Var 0.383 0.358
## Cumulative Var 0.383 0.741
The previous results serve to indicate if variables group in a meaningful way. In our example, you want to know if the indicators in each set are grouped together. These previous results can alse be visualized:
fa.diagram(resfa,main = "EFA results")
7.1 How are indicators contributing to the process?
# The higher the better
sort(resfa$communality)
## C3GroupGrievance Politicalculture
## 0.4034923 0.5068953
## S2RefugeesandIDPs Politicalparticipation
## 0.6161475 0.6393868
## E3HumanFlightandBrainDrain X1ExternalIntervention
## 0.6690640 0.6900314
## E1Economy C1SecurityApparatus
## 0.7306222 0.7567213
## C2FactionalizedElites E2EconomicInequality
## 0.7571858 0.7754234
## S1DemographicPressures Functioningofgovernment
## 0.8000285 0.8202032
## Electoralprocessandpluralism P3HumanRights
## 0.8298193 0.8575290
## P2PublicServices P1StateLegitimacy
## 0.8920841 0.9021443
## Civilliberties
## 0.9462528
7.2 What indicators are loading in more than one factor?
# closer to 2, 3, etc?
# closer to zero?
sort(resfa$complexity)
## Electoralprocessandpluralism Civilliberties
## 1.118078 1.184325
## E3HumanFlightandBrainDrain S1DemographicPressures
## 1.197527 1.258663
## P2PublicServices E1Economy
## 1.273266 1.283032
## Politicalparticipation E2EconomicInequality
## 1.306003 1.433648
## S2RefugeesandIDPs P3HumanRights
## 1.572465 1.607907
## P1StateLegitimacy X1ExternalIntervention
## 1.611564 1.628395
## C1SecurityApparatus Functioningofgovernment
## 1.691080 1.782136
## Politicalculture C3GroupGrievance
## 1.853424 1.939266
## C2FactionalizedElites
## 1.980759
9.1 Let’s remove some variables. These belong to ‘fragility’ but may be close to ‘democracy’:
ps=c("P1StateLegitimacy","P2PublicServices","P3HumanRights")
notPs=setdiff(names(dataForFA),ps)
dataForFA2=dataForFA[,notPs]
9.2 Recompute correlations
# esta es:
library(polycor)
corMatrix2=polycor::hetcor(dataForFA2)$correlations
9.3 Recheck conditions:
library(psych)
psych::KMO(corMatrix2)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix2)
## Overall MSA = 0.93
## MSA for each item =
## Electoralprocessandpluralism Functioningofgovernment
## 0.88 0.96
## Politicalparticipation Politicalculture
## 0.93 0.94
## Civilliberties C1SecurityApparatus
## 0.89 0.97
## C2FactionalizedElites C3GroupGrievance
## 0.96 0.90
## E1Economy E2EconomicInequality
## 0.96 0.90
## E3HumanFlightandBrainDrain S1DemographicPressures
## 0.95 0.89
## S2RefugeesandIDPs X1ExternalIntervention
## 0.93 0.93
cortest.bartlett(corMatrix2,n=nrow(dataForFA2))$p.value>0.05
## [1] FALSE
library(matrixcalc)
is.singular.matrix(corMatrix2)
## [1] FALSE
9.4. Get suggestions
fa.parallel(dataForFA2, fa = 'fa',correct = T,plot = F)
## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
9.5 Compute factors
library(GPArotation)
resfa <- fa(dataForFA2,
nfactors = 2,
cor = 'mixed',
rotate = "varimax",
fm="minres")
9.6 Explore results
Here, numerically
print(resfa$loadings,cutoff = 0.5)
##
## Loadings:
## MR1 MR2
## Electoralprocessandpluralism 0.905
## Functioningofgovernment -0.549 0.726
## Politicalparticipation 0.749
## Politicalculture 0.530
## Civilliberties 0.932
## C1SecurityApparatus 0.767
## C2FactionalizedElites 0.645 -0.578
## C3GroupGrievance
## E1Economy 0.812
## E2EconomicInequality 0.767
## E3HumanFlightandBrainDrain 0.791
## S1DemographicPressures 0.817
## S2RefugeesandIDPs 0.710
## X1ExternalIntervention 0.755
##
## MR1 MR2
## SS loadings 5.601 4.369
## Proportion Var 0.400 0.312
## Cumulative Var 0.400 0.712
Visually:
fa.diagram(resfa,main = "EFA results (2)")
9.7 Analyse new results
sort(resfa$communality)
## C3GroupGrievance Politicalculture
## 0.4008682 0.5022138
## S2RefugeesandIDPs Politicalparticipation
## 0.6331073 0.6544139
## E3HumanFlightandBrainDrain X1ExternalIntervention
## 0.6723772 0.7176851
## E2EconomicInequality E1Economy
## 0.7265745 0.7338534
## C2FactionalizedElites S1DemographicPressures
## 0.7506536 0.7516278
## C1SecurityApparatus Functioningofgovernment
## 0.7566777 0.8290041
## Electoralprocessandpluralism Civilliberties
## 0.8723964 0.9676060
sort(resfa$complexity)
## Electoralprocessandpluralism E3HumanFlightandBrainDrain
## 1.129833 1.147993
## E1Economy Civilliberties
## 1.224932 1.226998
## S1DemographicPressures Politicalparticipation
## 1.247477 1.323574
## E2EconomicInequality S2RefugeesandIDPs
## 1.445441 1.480002
## X1ExternalIntervention C1SecurityApparatus
## 1.484753 1.529322
## Functioningofgovernment Politicalculture
## 1.862299 1.972808
## C2FactionalizedElites C3GroupGrievance
## 1.976024 1.994779
It is not that we have the prefect solution, but you need to eventually stop.
Each factor is represented by an score:
summary(resfa$scores)
## MR1 MR2
## Min. :-2.3475 Min. :-2.2907
## 1st Qu.:-0.7799 1st Qu.:-0.8327
## Median : 0.1728 Median : 0.2267
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7639 3rd Qu.: 0.8467
## Max. : 1.6656 Max. : 1.6075
As you see, the scores are not in the range from zero to ten; let’s make the change:
library(BBmisc)
efa_scores=normalize(resfa$scores,
method = "range",
margin=2, # by column
range = c(0, 10))
# save the scores in the data
fromPy$Fragile_efa=efa_scores[,1]
fromPy$Demo_efa=efa_scores[,2]
You do not normally have a means to validate the scores, but our example data has pre computed scores. Let’s use those to see if our scores make sense.
library("ggpubr")
ggscatter(data=fromPy, x = "Overallscore", y = "Demo_efa",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "DemoIndex (original)", ylab = "DemoIndex (efa)")
## `geom_smooth()` using formula 'y ~ x'
ggscatter(data=fromPy, x = "Total", y = "Fragile_efa",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "FragileIndex (original)", ylab = "FragileIndex (efa)")
## `geom_smooth()` using formula 'y ~ x'